perm filename JHTALK[KI,ALS] blob
sn#097066 filedate 1974-04-14 generic text, type T, neo UTF8
00100 The Stanford AI Pitch-Synchronous Fourier-Transform Formant Extractor
00200
00300 The formant extractor is not a formant tracker in the usual sense since
00400 a fresh determination of the formant locations is made for each segment
00500 independently. This is thought to be desirable as it reveals
00600 rapid changes in formant location, particularly in the vicinity of
00700 obstruants where the character of the obstruant is frequently revealed
00800 more by these rapid transitions than by anything else. Only after this
00900 has been done is any attempt made to recogncile data for adjacent
01000 segments, as will be explained later.
01100
01200 Formant identification is based on the use of Fourier transforms using
01300 single pitch period segments where the segment starts and ends at the
01400 zero crossing which preceeds the maximum excursion in amplitude.
01500
01600 A study has been made of the effects of the segment location within the
01700 period and of the effect of the segment length. In general cleaner
01800 transforms are produced when the segment length is something less than
01900 the full period, 80% seems to be a reasonable compromise between
02000 cleanness and unwarrented broadening of the peaks in the spectrum because
02100 of insufficient points of data. However, it is questioned whether this
02200 is a reasonable thing to do since the location of the formant peaks is
02300 affected by the glottal loading during the latter part of the period
02400 and this is, of course, removed. It seems more reasonable to assume that
02500 the speaker modifies the shape of his upper vocal tract to compensate
02600 for his own pecular glottal loading effects since he attempts to produce
02700 sounds that match those produced by others and it is highly unlikely
02800 that the ear can do anything to diamiguate glottal coupling effects.
02900 It is observed that this glottal loading effect is more pronounced
03000 for pitch periods that happen to be longer than the average.
03100 For all appearances it seems that most speakers delay
03200 the closing of the glottis rather than lengthening the closed time
03300 when they drop the pitch of their voice. A reasomable thing
03400 to do thus seems to be to use the full period for intervals
03500 are normal or shorter and to restrict the length to the average
03600 length for long periods.
03700
03800 The location of the formant peaks
03900 is also shifted somewhat by shifts in the starting point in the period
04000 since windowing attenuates contributions to the transform from the
04100 edge portions of the data but this effect is small as compared with
04200 the increase in ease with which the peaks can be located for the
04300 starting location as mentioned.
04400
04500 The first operation is to locate the largest proper peaks found in
04600 each of six regions, these being the usual ranges for the first five
04700 formants and the region below the usual lower limit for the first
04800 formant. These limits are shifted between male and female voices, but
04900 in general we have not found it necessary to adjust them for the
05000 specific speaker. A proper peak is defined as the largest local maximum
05100 in the region that is bounded on both sides by points
05200 that are of lessor amplitude. If the five points for the five formant
05300 regions are distinct, that is no two are assigned the
05400 same value, the points are accepted as is, subject to a final
05500 medial smoothing operation which will be discribed later.
05600
05700 Since the ranges for the formants overlap, frequent conflicts occur
05800 and thes must now be resolved. This is done starting at the low
05900 frequency end. Somewhat different strategies are used for different
06000 possible conflicts.
06100
06200 Should the first and second formants identifications
06300 conflict then searches are made for the next largest proper peaks, to the
06400 low frequence side extending the region to zero, and to the high
06500 frequency side to the upper limit of the F2 band. The amplitudes of these
06600 two new peaks and their positions with respect to
06700 median values for the F1 and F2 regions are then compared. Actually
06800 a decision made on the basis of amplitude only, allowing a 6 db credit
06900 for the higher frequency peak, seems to make the right decision almost
07000 always. A study will be made of this matter when a larger sample of data
07100 becomes available.
07200
07300 Having resolved the conflict between F1 and F2, attention is then
07400 directed to a possible conflict between F2 and F3 which may have been
07500 introduced by the resolution of the F1 F2 conflict or which maw have been
07600 there initially. If a conflict is newly introduced then a second look
07700 is given to the F1 F2 conflict. Recourse is now made of a procedure
07800 to locate a possible F2 peak that had been obscured by a dominant
07900 F1 peak. The approximate shape of the original F1-F2 peak is assumed
08000 to be parobolic as determined from three data points these being that
08100 point at the maximum and points nearest the two three db down values.
08200 A fresh attempt is made to locate a new peak between the location of the
08300 disputed peak which is now extracted out from the data and the location
08400 previously found for F3. If such a peak is found it is assigned to
08500 F2 and attention is shifted to a possible F3-F4 conflict.
08600
08700 Should an initial conflict be found between F2 and F3, this is resolved
08800 in essentially the same way except that no attempt is made to find
08900 a possible hidden F3 as was done for F2. Instead, if a conflict between
09000 F4 and F5 is produced by the resolution of an F3-F4 conflict then this
09100 is resolved just as if it were an initial conflict.
09200
09300
09400 Under certain circumstances it seems to be impossible to resolve all
09500 conflicts by the procedures just discribed. When this occurs the fai,lure
09600 to locate a proper peak is signaled by storing a zero for the formant in
09700 question and the program proceeds to the next formant. On the completion
09800 of this first go-around a second look is given to any zero values, and
09900 finally if still unresolved the zeros are replaced by the value for the
10000 formant in question by the value found for the previous time slot.
10100
10200 Having resolved all conflicts in this way, then the exact locations for
10300 peaks are refined by parobolic interpolations based on the positions
10400 of the highest point and its two nearest neighbors. It is doubtful
10500 if the greater precision which results from this operation is at all
10600 needed, at least in the case of 512 point transforms on 20,000 hertz
10700 data. At least 2 bits of added precision can be obtained and
10800 the greatly improved smoothness of the resulting formant tracks seems
10900 to indicate that a corresponding incease in accuracy has resulted.
11000
11100 The procedures so far discribed result in very good formant tracks.
11200 However there are still isolated points which appear to be out of line.
11300 Most of these appear to be situations where a person would be quite
11400 unable to make an assured decision. A certain few can be traced to
11500 failures in the pitch period determining procedure while others are
11600 due to more obscure reasons. In almost all cases these abnormalities
11700 persist for but a single pitch period and they can be corrected by
11800 a final process of medial smoothing. This is done in one direction only,
11900 going forward in time each value for each formant is replaced by the
12000 median value of the point in question, its predisesor (as already
12100 corrected) and its successor. Individual points which lie between
12200 their neighbors are not altered by this procedure. Errant points are
12300 replaced by values for the nearest neighbor. This procedure does have
12400 the effect of correcting true extrema but an extrema which persists for
12500 but a single pitch period probably does not contain much phonetic
12600 information and can probably be ignored. One could make allowances for
12700 true extrema by applying the medial smoothing only to points that
12800 lie more than, say, 2 db away from their nearest neighbor. This
12900 refinement seems entirely unnecessary but it is being kept in reserve.
13000
13100 The advantages of this method of formant extraction over other more
13200 conventional tracking procedures seem to lie in the much improved
13300 results in the vicinity of obstruents where the rapid changes in formant
13400 location can be masked by tracking and where information as to the nature
13500 of the obstruent is contained in this transition region.